🕸️ Ada Research Browser

keycloak-sso-failure.md
← Back

Runbook: Keycloak SSO Failure

Alert

Severity

Critical -- Keycloak SSO failure prevents authentication to all platform UIs (Grafana, Harbor, NeuVector). Users cannot log in to monitoring dashboards or the container registry.

Impact

Investigation Steps

  1. Check Keycloak pod status:
kubectl get pods -n keycloak
  1. Check Keycloak pod logs:
kubectl logs -n keycloak keycloak-0 --tail=200
  1. Check if the Keycloak database (bundled PostgreSQL) is running:
kubectl get pods -n keycloak -l app.kubernetes.io/component=postgresql
kubectl logs -n keycloak -l app.kubernetes.io/component=postgresql --tail=100
  1. Check the Keycloak HelmRelease:
flux get helmrelease keycloak -n keycloak
  1. Test Keycloak health endpoint:
kubectl exec -n keycloak keycloak-0 -- curl -s http://localhost:8080/health/ready
  1. Check the OIDC well-known endpoint:
kubectl port-forward -n keycloak svc/keycloak-http 8080:80 &
curl -s http://localhost:8080/realms/sre/.well-known/openid-configuration | jq .
  1. Check Keycloak events for failed logins:
kubectl port-forward -n keycloak svc/keycloak-http 8080:80 &
# Get admin token
TOKEN=$(curl -s -X POST "http://localhost:8080/realms/master/protocol/openid-connect/token" \
  -d "client_id=admin-cli" \
  -d "username=admin" \
  -d "password=$(kubectl get secret keycloak -n keycloak -o jsonpath='{.data.admin-password}' | base64 -d)" \
  -d "grant_type=password" | jq -r '.access_token')

# Get recent events
curl -s -H "Authorization: Bearer $TOKEN" \
  "http://localhost:8080/admin/realms/sre/events?type=LOGIN_ERROR&max=20" | jq .
  1. Verify Istio VirtualService for Keycloak:
kubectl get virtualservice -n keycloak
kubectl describe virtualservice keycloak -n keycloak
  1. Check if the Keycloak service is accessible from other namespaces (e.g., Grafana):
kubectl run -n monitoring --rm -it --restart=Never curl-test --image=curlimages/curl:8.4.0 -- curl -s http://keycloak-http.keycloak.svc.cluster.local:80/realms/sre/.well-known/openid-configuration

Resolution

Keycloak pod not starting

  1. Check pod events:
kubectl describe pod keycloak-0 -n keycloak
  1. If the pod is in CrashLoopBackOff, check logs from the previous crash:
kubectl logs -n keycloak keycloak-0 --previous
  1. Common causes:
  2. Database connection failure (PostgreSQL not ready)
  3. Out of memory
  4. Configuration error after upgrade

  5. If the database is not ready, restart it first:

kubectl rollout restart statefulset -n keycloak -l app.kubernetes.io/component=postgresql
  1. Then restart Keycloak:
kubectl delete pod keycloak-0 -n keycloak

OIDC token endpoint unreachable from Grafana

  1. Grafana reaches Keycloak via internal service URL. Verify the service exists:
kubectl get svc keycloak-http -n keycloak
  1. Check that the Grafana OIDC configuration points to the correct URL:
kubectl get helmrelease kube-prometheus-stack -n monitoring -o yaml | grep -A 5 "token_url"

The token URL should be: http://keycloak-http.keycloak.svc.cluster.local:80/realms/sre/protocol/openid-connect/token

  1. If the URL is incorrect, update the monitoring HelmRelease values in Git

SSO redirect loop

  1. This usually indicates a mismatch between the Keycloak hostname and the redirect URL.

  2. Check the Keycloak hostname configuration:

kubectl get pod keycloak-0 -n keycloak -o yaml | grep -A 2 "KC_HOSTNAME"
  1. Verify KC_HOSTNAME matches the external URL used by clients (e.g., keycloak.apps.sre.example.com)

  2. Check KC_HOSTNAME_PORT is set correctly (should be 443 for HTTPS via Istio gateway)

  3. Verify the OIDC client redirect URIs in Keycloak match the actual application URLs

OIDC client misconfigured

  1. Access the Keycloak admin console (port-forward or via Istio gateway)
  2. Navigate to the SRE realm -> Clients
  3. Verify each client has correct:
  4. Valid Redirect URIs
  5. Web Origins
  6. Client secret matches what is configured in the consuming service

  7. For Grafana, the client configuration should be:

Setting Value
Client ID grafana
Valid Redirect URIs https://grafana.apps.sre.example.com/*
Web Origins https://grafana.apps.sre.example.com

Keycloak database corruption

  1. If Keycloak logs show database errors:
kubectl logs -n keycloak keycloak-0 --tail=200 | grep -i "database\|postgres\|sql"
  1. Check PostgreSQL pod:
kubectl logs -n keycloak -l app.kubernetes.io/component=postgresql --tail=100
  1. If the database is corrupted and persistence is disabled (lab environment), restart both:
kubectl delete pod -n keycloak --all
  1. If persistence is enabled, restore from the most recent Velero backup of the keycloak namespace

Emergency: bypass Keycloak for platform access

If Keycloak is completely down and you need access to Grafana:

  1. Use the local admin account:
  2. Username: admin
  3. Password: prom-operator (or check grafana-admin-credentials secret)

  4. For Harbor:

  5. Username: admin
  6. Password: check harbor-core secret (HARBOR_ADMIN_PASSWORD field)

  7. For NeuVector:

  8. Username: admin
  9. Password: admin (default)

Prevention

Escalation